feat(pipeline): harvest related legislation discovered during enrichment by tdjager · Pull Request #901 · MinBZK/regelrecht

tdjager · 2026-07-02T18:59:55Z

Why

Regelingen reached only through the machine-readable model — a source.regulation binding, a legal_basis anchor, or an open delegation (open_terms/implements, e.g. regeling_standaardpremie filling wet_op_de_zorgtoeslag's standaardpremie) — are never auto-harvested. The recursive harvester follows only <extref> BWB hyperlinks in the source text, and those model links don't exist in the text — they are a product of enrichment. So the discovery must happen where the model is created: the enrichworker (which has corpus + DB). (A prior admin-side closure endpoint, #894, was closed — the deployed admin has no corpus.)

What

The enrichment agent now returns the related legislation it needs in a result envelope — a .enrichment-result.yaml sidecar the agent writes, kept deliberately out of the law YAML so the law artifact stays schema-conformant (extensible for future enrichment return-values). After a successful enrich, the worker:

reads the envelope's related_legislation,
resolves each entry to a BWB id — hybrid: agent bwb_id → law_entries slug lookup → single-hit SRU title search (>1 → needs_confirmation, skip),
enqueues a follow-up harvest per resolved id via the canonical dedup path.

Newly harvested laws auto-enrich (existing ENRICH_AUTO_ENQUEUE) and return their related legislation → the dependency graph fills itself by recursion.

Depth & priority

EnrichPayload.depth carries recursion depth harvest → enrich → harvest; each related harvest is depth+1 at priority = 40 − (depth+1), so deeper (more speculative) nesting yields to shallower/interactive harvests (Priority::new clamps 0..=100; queue is ORDER BY priority DESC).

Safety / guards

Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH (default 2).
Reuses ENRICH_DAILY_LIMIT (spend) and create_harvest_job_if_not_exists (dedup); harvest_exhausted skipped.
Best-effort: the whole block runs after the enrichment is committed and swallows every error into logs — it can never fail the enrichment or the worker loop.

Also

search_bwb_by_name(client, q) extracted from the axum handler (reused by both); find_bwb_id_by_slug made pub; slug hits re-validated as BWB (CVDR skipped, so no malformed harvest).
No law-schema change. .claude/skills/law-generate/SKILL.md instructs writing the envelope.

Known limitations

Shared depth counter with the extref harvester → the loop fires for roots/shallow laws; a related_depth field is the clean follow-up.
Unambiguous-but-loose SRU match accepted as-is (opt-in, harvest-only, lands on a reviewable enrich/{provider} branch).

Testing

Unit (52 pipeline tests green): envelope (de)serialization (full/missing/absent/malformed → empty), is_valid_bwb_id/slugify, related_harvest_priority (clamps, −1/level), depth round-trips.
End-to-end (to run on the preview): set HARVEST_RELATED_LEGISLATION=true, (re-)enrich wet_op_de_zorgtoeslag → confirm it writes .enrichment-result.yaml, a follow-up harvest for the delegated regeling is enqueued (depth 1, priority 39), harvested, then auto-enriched. Verify ENRICH_DAILY_LIMIT still bounds spend.

Regelingen reached only through the machine-readable model (source.regulation, legal_basis, and open_terms/implements delegations like regeling_standaardpremie) are never auto-harvested: the recursive harvester follows only <extref> BWB hyperlinks in the source text, and those model links are a product of enrichment. The enrichment agent now returns the related legislation it needs in a result envelope (.enrichment-result.yaml) — deliberately kept OUT of the law YAML so the law artifact stays schema-conformant. After a successful enrich the worker reads the envelope, resolves each entry to a BWB id (agent bwb_id -> law_entries slug -> single-hit SRU title search), and enqueues a follow-up harvest for each. Newly harvested laws auto-enrich and return their own related legislation, so the dependency graph fills itself by recursion. - EnrichPayload.depth carries recursion depth (harvest -> enrich -> harvest); each related harvest is depth+1 at priority 40-(depth+1), so deeper nesting yields to shallower/interactive harvests. - Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH (default 2); reuses ENRICH_DAILY_LIMIT for spend and create_harvest_job_if_not_exists for dedup. Best-effort: nothing here can fail the already-committed enrichment. - search_bwb_by_name extracted from the axum handler for reuse; find_bwb_id_by_slug made pub; slug hits re-validated as BWB (CVDR skipped). RFC-025 documents the pattern and its known limitations.

Drop the HARVEST_RELATED_LEGISLATION opt-in flag. The follow-up harvest only enqueues harvest jobs (no LLM cost); the expensive re-enrichment of those laws stays gated by ENRICH_AUTO_ENQUEUE + ENRICH_DAILY_LIMIT, and the recursion is bounded by RELATED_HARVEST_MAX_DEPTH. So there is nothing to protect behind a separate flag.

The enrichment result envelope (.enrichment-result.yaml) and the existing .enrichment.yaml metadata sidecar are written into a law directory but are not law files. `find -name '*.yaml'` matches leading-dot names and the pre-commit `files:` regex matches them too, so validating one fails (missing $id). Skip dot-prefixed sidecars in both script/validate.sh and the validate-law-yaml hook.

claude

This PR adds a feedback loop that lets the enrichment agent declare the related legislation it just modeled, then enqueues follow-up harvests so the dependency graph fills itself in recursively. The core design (sidecar envelope, hybrid BWB resolution, depth/priority capping, best-effort error handling) is sound, and the unit tests cover the main serialization and resolution paths well.

Two issues found in the new resolution logic:

🟠 Significant — CVDR slug hit falls through to SRU name search

packages/pipeline/src/worker.rs, in resolve_related_bwb_id

When step (b) finds the slug in law_entries but the stored id is a CVDR id (not a BWB id), the code logs "skipping" but does not return — execution falls through to step (c), the SRU name search on wetten.overheid.nl.

Ok(Some(id)) if is_valid_bwb_id(&id) => return RelatedResolution::Resolved(id),
Ok(Some(id)) => {
    tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
    // no return here — falls through to SRU search
}

law_entries does store CVDR ids (confirmed — the harvest worker calls upsert_law with the CVDR job.law_id). A law named e.g. "Verordening X" can have a slug hit mapping to a CVDR id, then a single SRU hit for a different national law with a similar name, producing Resolved(wrong_bwb_id) where Unresolved was intended. The comment itself says "skip non-BWB hits" — the code should do exactly that:

Ok(Some(id)) => {
    tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
    return RelatedResolution::Unresolved;  // add this
}

🟡 Minor — `resolved` counter inflated by exhausted-skipped laws

packages/pipeline/src/worker.rs, in harvest_related_legislation

resolved += 1 fires before the HarvestExhausted guard. When a law is exhausted, the loop continues without incrementing enqueued, so the summary log silently folds both "exhausted-skipped" and "already-queued (dedup)" into the resolved − enqueued gap with no way to distinguish them.

resolved += 1;              // ← incremented here
if let Ok(law) = law_status::get_law(pool, &bwb_id).await {
    if law.status == LawStatusValue::HarvestExhausted {
        continue;           // enqueued not incremented — gap is ambiguous
    }
}

Moving resolved += 1 to after the exhausted check, or adding a dedicated exhausted counter to the summary log, would make the log actionable.

🟡 Minor — blocking I/O in async context

packages/pipeline/src/enrich.rs:438

read_enrichment_result_envelope uses std::fs::read_to_string (synchronous) and is called directly from the async execute_enrich_with_runner. The surrounding code already uses tokio::fs::write. In practice the file is tiny and arrives after a minutes-long LLM call, so the impact is negligible on normal disk — but under a slow filesystem (NFS, container overlay under memory pressure) this blocks the Tokio worker thread indefinitely. Wrapping in tokio::task::spawn_blocking or switching to tokio::fs::read_to_string would match the pattern used elsewhere in the file.

Address CI review findings: - A CVDR slug hit no longer falls through to the SRU name search (the slug already identified the law; a title match could resolve a *different* national law). It now returns Unresolved. - The harvest summary log separates already_queued and exhausted skips instead of conflating them in the resolved-but-not-enqueued gap. Also drop RFC-025: the related-legislation harvest loop is an implementation detail, not a cross-cutting design decision that warrants an RFC.

tdjager · 2026-07-02T20:19:29Z

Both review findings addressed in 4a24029: the CVDR slug hit now returns Unresolved instead of falling through to the SRU name search, and the harvest summary log separates already_queued/exhausted skips. Also dropped RFC-025 (this is an implementation detail, not an RFC).

claude

Pipeline: harvest related legislation discovered during enrichment

Correctness

🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rs — the resolve_related_bwb_id function resolves via three paths:

(a) explicit bwb_id: guarded by is_valid_bwb_id
(b) slug lookup: guarded by is_valid_bwb_id
(c) SRU single hit: results[0].bwb_id.clone() returned directly, no validation

BwbSearchResult.bwb_id comes from parse_sru_response which parses wetten.overheid.nl SRU output — in practice that's always a BWBR ID — but the inconsistency means a malformed response would propagate an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently, but it's an easy one-liner fix:

Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
    RelatedResolution::Resolved(results[0].bwb_id.clone())
}

🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438 — read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny YAML sidecar the impact is negligible, but it's inconsistent:

// current (sync, blocks runtime thread)
let content = match std::fs::read_to_string(&envelope_path) { … }

// idiomatic in this async context
let content = match tokio::fs::read_to_string(&envelope_path).await { … }

Requires making read_enrichment_result_envelope async and threading .await through the caller.

🟡 Minor — depth comment at harvest→enrich handoff is misleading
packages/pipeline/src/worker.rs — the comment at depth: payload.depth:

// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,

payload here is a HarvestPayload whose depth is the extref recursion counter, not the related-legislation chain counter. The comment implies only related chains inherit this value. In reality, any harvest at extref depth ≥ related_harvest_max_depth (2 by default) will have its enrichment silently skip related-legislation follow-ups — the PR description acknowledges this as a known limitation but the comment in code doesn't signal it. Worth expanding:

// Inherit the harvest's depth (extref OR related-chain). This
// conflates both depth counters — a law reached via ≥2 extref hops
// won't get its related legislation harvested. Clean-up: a separate
//  field on HarvestPayload.
depth: payload.depth,

Rest of the diff

bwb_search.rs extraction of search_bwb_by_name: clean refactor, error mapping is correct.
config.rs / env parsing: parse().ok() silently falls back to 2 on any bad value — appropriate.
enrich.rs envelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct.
.pre-commit-config.yaml / validate.sh: exclusion regexes are correct and consistent.
SKILL.md Phase 4.5: clear instructions; the CRITICAL warning not to put the envelope in the law YAML is exactly right.
Unit tests cover serialization, absent/malformed envelopes, clamping, and slug normalization. Good coverage.

No critical or significant issues.

claude

Pipeline: harvest related legislation discovered during enrichment

Correctness

🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rs — resolve_related_bwb_id guards path (a) and (b) with is_valid_bwb_id, but path (c) returns results[0].bwb_id.clone() directly:

Ok(results) if results.len() == 1 => RelatedResolution::Resolved(results[0].bwb_id.clone()),

BwbSearchResult.bwb_id comes from parse_sru_response parsing wetten.overheid.nl SRU output — in practice always a BWBR ID — but the inconsistency means a malformed API response propagates an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently; easy fix:

Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
    RelatedResolution::Resolved(results[0].bwb_id.clone())
}

🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438 — read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny sidecar the impact is negligible, but it's inconsistent with the surrounding code.

🟡 Minor — depth comment at harvest→enrich handoff understates the conflation
packages/pipeline/src/worker.rs:

// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,

payload here is a HarvestPayload whose depth is the extref recursion counter. The comment implies only related chains inherit this value. In reality any law reached via ≥ RELATED_HARVEST_MAX_DEPTH extref hops will silently have its related-legislation follow-ups skipped. The PR description documents this as a known limitation; the in-code comment should say so too.

Rest of the diff

bwb_search.rs extraction of search_bwb_by_name: clean refactor, error mapping is correct.
config.rs env parsing: parse().ok() silently falls back to 2 on bad input — appropriate for a tuning knob.
enrich.rs envelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct.
.pre-commit-config.yaml / validate.sh: exclusion patterns are correct and consistent with each other.
SKILL.md Phase 4.5: clear instructions; the CRITICAL warning not to put the envelope inside the law YAML is exactly right.
Unit tests cover serialization, absent/malformed envelopes, priority clamping, and slug normalization — good coverage.

No critical or significant issues.

- Validate the single-hit SRU result as a BWB id before resolving, so a malformed SRU id can't slip into a harvest payload (paths a/b already validate). - Read the .enrichment-result.yaml sidecar via tokio::fs (was blocking std::fs) for consistency with the rest of execute_enrich_with_runner. - Clarify the depth-inherit comment: the field is the shared extref-recursion counter, so deep-via-extref laws skip related discovery (roots/shallow laws, the intended case, are unaffected).

tdjager added 3 commits July 2, 2026 18:58

tdjager marked this pull request as ready for review July 2, 2026 20:01

claude Bot reviewed Jul 2, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Jul 2, 2026

claude Bot reviewed Jul 2, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Jul 2, 2026

tdjager merged commit d2992e9 into main Jul 2, 2026
30 checks passed

tdjager deleted the feat/enrich-related-legislation branch July 2, 2026 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pipeline): harvest related legislation discovered during enrichment#901

feat(pipeline): harvest related legislation discovered during enrichment#901
tdjager merged 5 commits into
mainfrom
feat/enrich-related-legislation

tdjager commented Jul 2, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

tdjager commented Jul 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tdjager commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Depth & priority

Safety / guards

Also

Known limitations

Testing

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

🟠 Significant — CVDR slug hit falls through to SRU name search

🟡 Minor — resolved counter inflated by exhausted-skipped laws

🟡 Minor — blocking I/O in async context

Uh oh!

tdjager commented Jul 2, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Pipeline: harvest related legislation discovered during enrichment

Correctness

Rest of the diff

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Pipeline: harvest related legislation discovered during enrichment

Correctness

Rest of the diff

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tdjager commented Jul 2, 2026 •

edited

Loading

🟡 Minor — `resolved` counter inflated by exhausted-skipped laws